Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition

نویسندگان

چکیده

Transformer-based methods have recently achieved great advancement on 2D image-based vision tasks. For 3D video-based tasks such as action recognition, however, directly applying spatiotemporal transformers video data will bring heavy computation and memory burdens due to the largely increased number of patches quadratic complexity self-attention computation. How efficiently effectively model has been a challenge for transformers. In this paper, we propose Temporal Patch Shift (TPS) method efficient modeling in recognition. TPS shifts part with specific mosaic pattern temporal dimension, thus converting vanilla spatial operation one little additional cost. As result, can compute using nearly same cost self-attention. is plug-and-play module be inserted into existing transformer models enhance feature learning. The proposed achieves competitive performance state-of-the-arts Something-something V1 & V2, Diving-48, Kinetics400 while being much more source code found at https://github.com/MartinXM/TPS.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition

Research in human action recognition has accelerated significantly since the introduction of powerful machine learning tools such as Convolutional Neural Networks (CNNs). However, effective and efficient methods for incorporation of temporal information into CNNs are still being actively explored in the recent literature. Motivated by the popular recurrent attention models in the research area ...

متن کامل

Pose-conditioned Spatio-Temporal Attention for Human Action Recognition

We address human action recognition from multi-modal video data involving articulated pose and RGB frames and propose a two-stream approach. The pose stream is processed with a convolutional model taking as input a 3D tensor holding data from a sub-sequence. A specific joint ordering, which respects the topology of the human body, ensures that different convolutional layers correspond to meanin...

متن کامل

Spatiotemporal Residual Networks for Video Action Recognition

Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos. Recently, Residual Networks (ResNets) have arisen as a new technique to train extremely deep architectures. In this paper, we introduce spatiotemporal ResNets as a combination of these two approaches. Our novel architecture generalizes ResNets for the spatiotemporal domain by intro...

متن کامل

Attention shift decoding for conversational speech recognition

We introduce a novel approach to decoding in speech recognition (termed attention-shift decoding) that attempts to mimic aspects of human speech recognition responsible for robustness in processing conversational speech. Our approach is a radical departure from traditional decoding algorithms for speech recognition. We propose a method to first identify reliable regions of the speech signal and...

متن کامل

Temporal Modeling of Spatiotemporal Networks

A spatiotemporal network is a spatial network (e.g., road network) along with the corresponding time-dependent travel-time for each segment of the network. Design and analysis of policies and plans on spatiotemporal networks (e.g., for path planning with locationbased services) require realistic models that accurately represent the temporal behavior of such networks. In this paper, for the firs...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2022

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-031-20062-5_36